7 research outputs found

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56

    Query-Constraint-Based Mining of Association Rules for Exploratory Analysis of Clinical Datasets in the National Sleep Research Resource

    Get PDF
    Background: Association Rule Mining (ARM) has been widely used by biomedical researchers to perform exploratory data analysis and uncover potential relationships among variables in biomedical datasets. However, when biomedical datasets are high-dimensional, performing ARM on such datasets will yield a large number of rules, many of which may be uninteresting. Especially for imbalanced datasets, performing ARM directly would result in uninteresting rules that are dominated by certain variables that capture general characteristics. Methods: We introduce a query-constraint-based ARM (QARM) approach for exploratory analysis of multiple, diverse clinical datasets in the National Sleep Research Resource (NSRR). QARM enables rule mining on a subset of data items satisfying a query constraint. We first perform a series of data-preprocessing steps including variable selection, merging semantically similar variables, combining multiple-visit data, and data transformation. We use Top-k Non-Redundant (TNR) ARM algorithm to generate association rules. Then we remove general and subsumed rules so that unique and non-redundant rules are resulted for a particular query constraint. Results: Applying QARM on five datasets from NSRR obtained a total of 2517 association rules with a minimum confidence of 60% (using top 100 rules for each query constraint). The results show that merging similar variables could avoid uninteresting rules. Also, removing general and subsumed rules resulted in a more concise and interesting set of rules. Conclusions: QARM shows the potential to support exploratory analysis of large biomedical datasets. It is also shown as a useful method to reduce the number of uninteresting association rules generated from imbalanced datasets. A preliminary literature-based analysis showed that some association rules have supporting evidence from biomedical literature, while others without literature-based evidence may serve as the candidates for new hypotheses to explore and investigate. Together with literature-based evidence, the association rules mined over the NSRR clinical datasets may be used to support clinical decisions for sleep-related problems

    Additional file 1 of Query-constraint-based mining of association rules for exploratory analysis of clinical datasets in the National Sleep Research Resource

    No full text
    Results obtained: Results.zip contains the results obtained by merged and unmerged methods for different query constraints across the five datasets in NSRR. (ZIP 199 kb

    Identification of missing hierarchical relations in the vaccine ontology using acquired term pairs

    Full text link
    Abstract Background The Vaccine Ontology (VO) is a biomedical ontology that standardizes vaccine annotation. Errors in VO will affect a multitude of applications that it is being used in. Quality assurance of VO is imperative to ensure that it provides accurate domain knowledge to these downstream tasks. Manual review to identify and fix quality issues (such as missing hierarchical is-a relations) is challenging given the complexity of the ontology. Automated approaches are highly desirable to facilitate the quality assurance of VO. Methods We developed an automated lexical approach that identifies potentially missing is-a relations in VO. First, we construct two types of VO concept-pairs: (1) linked; and (2) unlinked. Each concept-pair further derives an Acquired Term Pair (ATP) based on their lexical features. If the same ATP is obtained by a linked concept-pair and an unlinked concept-pair, this is considered to indicate a potentially missing is-a relation between the unlinked pair of concepts. Results Applying this approach on the 1.1.192 version of VO, we were able to identify 232 potentially missing is-a relations. A manual review by a VO domain expert on a random sample of 70 potentially missing is-a relations revealed that 65 of the cases were valid missing is-a relations in VO (a precision of 92.86%). Conclusions The results indicate that our approach is highly effective in identifying missing is-a relation in VO.http://deepblue.lib.umich.edu/bitstream/2027.42/174102/1/13326_2022_Article_276.pd

    Query-constraint-based mining of association rules for exploratory analysis of clinical datasets in the National Sleep Research Resource

    Get PDF
    Abstract Background Association Rule Mining (ARM) has been widely used by biomedical researchers to perform exploratory data analysis and uncover potential relationships among variables in biomedical datasets. However, when biomedical datasets are high-dimensional, performing ARM on such datasets will yield a large number of rules, many of which may be uninteresting. Especially for imbalanced datasets, performing ARM directly would result in uninteresting rules that are dominated by certain variables that capture general characteristics. Methods We introduce a query-constraint-based ARM (QARM) approach for exploratory analysis of multiple, diverse clinical datasets in the National Sleep Research Resource (NSRR). QARM enables rule mining on a subset of data items satisfying a query constraint. We first perform a series of data-preprocessing steps including variable selection, merging semantically similar variables, combining multiple-visit data, and data transformation. We use Top-k Non-Redundant (TNR) ARM algorithm to generate association rules. Then we remove general and subsumed rules so that unique and non-redundant rules are resulted for a particular query constraint. Results Applying QARM on five datasets from NSRR obtained a total of 2517 association rules with a minimum confidence of 60% (using top 100 rules for each query constraint). The results show that merging similar variables could avoid uninteresting rules. Also, removing general and subsumed rules resulted in a more concise and interesting set of rules. Conclusions QARM shows the potential to support exploratory analysis of large biomedical datasets. It is also shown as a useful method to reduce the number of uninteresting association rules generated from imbalanced datasets. A preliminary literature-based analysis showed that some association rules have supporting evidence from biomedical literature, while others without literature-based evidence may serve as the candidates for new hypotheses to explore and investigate. Together with literature-based evidence, the association rules mined over the NSRR clinical datasets may be used to support clinical decisions for sleep-related problems
    corecore